Evaluation of NLG: Some Analogies and Differences with Machine Translation and Reference Resolution
نویسنده
چکیده
This short paper first outlines an explanatory model that contrasts the evaluation of systems for which human language appears in their input with systems for which language appears in their output, or in both input and output. The paper then compares metrics for NLG evaluation with those applied to MT systems, and then with the case of reference resolution, which is the reverse task of generating referring expressions. 1 Challenges in NLG Evaluation Defining shared-task evaluation campaigns (STECs) is often the key to making progress in a particular domain, thanks to the convergence of several research teams. However, the definition of STECs requires an acceptable agreement, among a community of researchers, on the relevance of the selected problem to the domain, as well as on common evaluation metrics that indicate progress on this task. In the domain of Natural Language Generation (NLG), recent proposals have started meeting the challenge of STEC definition (Belz and Kilgarriff, 2006), few years after a new metric for Machine Translation (MT) evaluation (Papineni et al., 2001) had revived the interest for common evaluations, thanks to its low application costs, which in turn led to significant improvement of MT systems, and especially statistical ones. So, an important question is: how could NLG benefit from a similarly innovative metric, and how could such a metric be found? This short paper offers an explanation of the difficulty to evaluate NLG systems based on a typology of natural language processing (NLP) systems, and draws from this typology some suggestions for NLG evaluation (Section 2). Then, NLG evaluation is compared to MT evaluation (Section 3). Finally, the focus is set on referring expressions (REs), which have been used in the task proposed at the 2007 UCNLG+MT workshop, and which might help providing an indirect measure of NLG “quality” by combining the generation of REs with reference resolution (Section 4). 2 A Typology of NLP Systems and Its Relation to Evaluation Some approaches to evaluation distinguish intrinsic from extrinsic methods (Sparck Jones and Galliers, 1996), i.e. methods that try to assess the “quality” of an output vs. methods that estimate its “utility” for a given task. Other approaches distinguish internal from external evaluation, and then evaluation in use (ISO/IEC, 2001): internal methods look at static properties of a system while external ones assessing its behaviour when it runs. These types of evaluation are not equally well adapted to the various types of NLP systems. A useful typology of NLP tasks can be based on the role of language among the input and/or output to a system (Popescu-Belis, 2007). One can distinguish systems that have language as input (type A for ‘analysis’), systems that have language as output (type G for ‘generation’), systems that combine the two (type AG), and systems that must interact with a human user to produce a result (type AGI, with I for ‘interactive’). Type A systems typically produce some form of
منابع مشابه
The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملComparing Automatic and Human Evaluation of NLG Systems
We consider the evaluation problem in Natural Language Generation (NLG) and present results for evaluating several NLG systems with similar functionality, including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts, human non-experts, and several automatic evaluation metrics, including NIST, BLEU, and ROUGE. We f...
متن کاملThe Use of Second-Person Reference in Advertisement Translation with Reference to Translation between Chinese and English
This research aimed to review the use of second-person reference in advertisement translation, work out the general rules, and provide guidance to translators. Using second-person reference is common in the advertising discourse. Addressing audiences directly involves their attention and in this way enhances their memorization of the advertised message. Second-person reference can be realized v...
متن کاملEvaluating an NLG System using Post-Editing
Computer-generated texts, whether from Natural Language Generation (NLG) or Machine Translation (MT) systems, are often post-edited by humans before being released to users. The frequency and type of post-edits is a measure of how well the system works, and can be used for evaluation. We describe how we have used post-edit data to evaluate SUMTIME-MOUSAM, an NLG system that produces weather for...
متن کاملEvaluation of an NLG System using Post-Edit Data: Lessons Learnt
Post-editing is commonly performed on computer-generated texts, whether from Machine Translation (MT) or NLG systems, to make the texts acceptable to end users. MT systems are often evaluated using post-edit data. In this paper we describe our experience of using post-edit data to evaluate SUMTIME-MOUSAM, an NLG system that produces marine weather forecasts.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007